A Florida health insurance company wants to predict annual claims for individual clients. The company pulls a random sample of 50 customers. The owner wishes to charge an actuarially fair premium to ensure a normal rate of return. The owner collects all of their current customer’s health care expenses from the last year and compares them with what is known about each customer’s plan.

The data on the 50 customers in the sample is as follows:

  • Charges: Total medical expenses for a particular insurance plan (in dollars)
  • Age: Age of the primary beneficiary
  • BMI: Primary beneficiary’s body mass index (kg/m2)
  • Female: Primary beneficiary’s birth sex (0 = Male, 1 = Female)
  • Children: Number of children covered by health insurance plan (includes other dependents as well)
  • Smoker: Indicator if primary beneficiary is a smoker (0 = non-smoker, 1 = smoker)
  • Cities: Dummy variables for each city with the default being Sanford

Answer the following questions using complete sentences and attach all output, plots, etc. within this report.

Question 1

Randomly select three observations from the sample and exclude from all modeling (i.e. n=47). Provide the summary statistics (min, max, std, mean, median) of the quantitative variables for the 47 observations.

Table Summary of Quantitative Variables (except Children) for the 47 observations
Characteristic N = 47
Charges
Mean (SD) 12,317 (11,498)
Median (IQR) 8,604 (4,480, 13,552)
Range 2,494, 55,135
Age
Mean (SD) 42 (13)
Median (IQR) 43 (30, 53)
Range 23, 64
BMI
Mean (SD) 29.0 (5.6)
Median (IQR) 28.5 (25.3, 32.4)
Range 16.8, 42.1
Children Summary Data
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.000   1.234   2.000   5.000
Children Standard Deviation
## [1] 1.18345

Question 2

Provide the correlation between all quantitative variables

Question 3

Run a regression that includes all independent variables in the data table. Does the model above violate any of the Gauss-Markov assumptions? If so, what are they and what is the solution for correcting?

Summary Regression Output of all independent variables (n=47)
## 
## Call:
## lm(formula = Charges ~ ., data = insurance.new)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11888  -2726  -1065    711  20257 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -14022.39    6563.47  -2.136 0.039145 *  
## Age              287.26      77.04   3.729 0.000626 ***
## BMI              434.97     200.14   2.173 0.036058 *  
## Female           858.33    2120.59   0.405 0.687923    
## Children         118.17     873.64   0.135 0.893122    
## Smoker         23108.13    3009.97   7.677 3.04e-09 ***
## WinterSprings  -1659.04    3069.60  -0.540 0.592024    
## WinterPark     -4853.57    3009.55  -1.613 0.115080    
## Oviedo         -3769.38    2566.29  -1.469 0.150115    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6722 on 38 degrees of freedom
## Multiple R-squared:  0.7176, Adjusted R-squared:  0.6582 
## F-statistic: 12.07 on 8 and 38 DF,  p-value: 2.224e-08
Plot of Insurance Data and Scatterplot Matrix of all Quantitative Variables

We ran a regression model for all independent variables and found the following violations to the Gauss-Markov Theorum Assumptions:

3rd Assumption - Non-Linearity. Residuals v Fitted. Functional Forms.
- Consider using ratios or percentages rather than raw data (see module on multicollinearity for a complete discussion of the associated problems and causes).

4th Assumption - Heteroskedasticity Is Occurring Within Scale-Location
- There is a cluster of observations around the 2,500 to 15,000 Fitted Values axis which then fans outwards. Resulting in inefficient cross-section estimates.

6th Assumption - Normal Distribution Is Not In Place. [Normal Q-Q)]
- Look for subgroups in data and analyze separately; use summary data (like the mean value) rather than the raw data.

Question 4

Implement the solutions from question 3, such as data transformation, along with any other changes you wish. Use the sample data and run a new regression. How have the fit measures changed? How have the signs and significance of the coefficients changed?

Scatterplot Matrix’s for the Log of Charges and the Insurance Data minus Dummy Variables

Summary Regression Model with Log of Charges
## 
## Call:
## lm(formula = LogCharges ~ ., data = insurance.new[, c(10, 2:9)])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.65510 -0.14862 -0.05322  0.03263  1.28444 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    7.033276   0.387771  18.138  < 2e-16 ***
## Age            0.034991   0.004552   7.688 2.94e-09 ***
## BMI            0.011547   0.011824   0.977    0.335    
## Female         0.054880   0.125285   0.438    0.664    
## Children       0.063550   0.051615   1.231    0.226    
## Smoker         1.324284   0.177829   7.447 6.16e-09 ***
## WinterSprings -0.007282   0.181353  -0.040    0.968    
## WinterPark    -0.051822   0.177804  -0.291    0.772    
## Oviedo        -0.144341   0.151617  -0.952    0.347    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3972 on 38 degrees of freedom
## Multiple R-squared:  0.7924, Adjusted R-squared:  0.7487 
## F-statistic: 18.13 on 8 and 38 DF,  p-value: 8.493e-11
Model: Age with a logarithmic Shape
## 
## Call:
## lm(formula = LogCharges ~ ., data = insurance_LogChrgAgeWDummy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.58853 -0.17786 -0.05451  0.02616  1.27653 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.17330    0.73015   4.346 9.98e-05 ***
## LogAge         1.42556    0.18545   7.687 2.95e-09 ***
## BMI            0.01451    0.01178   1.232    0.225    
## Female         0.06560    0.12535   0.523    0.604    
## Children       0.05664    0.05168   1.096    0.280    
## Smoker         1.32511    0.17782   7.452 6.07e-09 ***
## WinterSprings -0.02476    0.18155  -0.136    0.892    
## WinterPark    -0.07879    0.17815  -0.442    0.661    
## Oviedo        -0.14899    0.15168  -0.982    0.332    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3972 on 38 degrees of freedom
## Multiple R-squared:  0.7924, Adjusted R-squared:  0.7487 
## F-statistic: 18.13 on 8 and 38 DF,  p-value: 8.507e-11

Plots for Model results for Log of Charges and Log of Age

Scatterplot Martix of Log of Charges/Log of Age compared to Log of Charges and Age Squared

Model: Age with a Quadratic Relationship
## 
## Call:
## lm(formula = LogCharges ~ ., data = insurance.new[, c(12, 2:10)])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.62995 -0.14987 -0.05370  0.02717  1.28495 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.7322920  0.9363606   7.190 1.59e-08 ***
## AgeSq         -0.0001643  0.0004640  -0.354    0.725    
## Age            0.0492269  0.0404749   1.216    0.232    
## BMI            0.0124770  0.0122478   1.019    0.315    
## Female         0.0605778  0.1277695   0.474    0.638    
## Children       0.0598072  0.0532787   1.123    0.269    
## Smoker         1.3245151  0.1799132   7.362 9.39e-09 ***
## WinterSprings -0.0149998  0.1847672  -0.081    0.936    
## WinterPark    -0.0626046  0.1824473  -0.343    0.733    
## Oviedo        -0.1470754  0.1535865  -0.958    0.344    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4018 on 37 degrees of freedom
## Multiple R-squared:  0.7931, Adjusted R-squared:  0.7428 
## F-statistic: 15.76 on 9 and 37 DF,  p-value: 3.566e-10

Summary Model for BMI with a Logarithmic shape
## 
## Call:
## lm(formula = LogCharges ~ ., data = insurance.new[, c(2, 4:10, 
##     13)])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.65610 -0.15185 -0.05397  0.02865  1.27595 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.313609   1.106364   5.707 1.44e-06 ***
## Age            0.034944   0.004564   7.656 3.25e-09 ***
## Female         0.056410   0.125504   0.449    0.656    
## Children       0.064999   0.051857   1.253    0.218    
## Smoker         1.323267   0.177896   7.438 6.32e-09 ***
## WinterSprings -0.005992   0.181873  -0.033    0.974    
## WinterPark    -0.045362   0.176489  -0.257    0.799    
## Oviedo        -0.140444   0.151103  -0.929    0.359    
## LogBMI         0.314013   0.330373   0.950    0.348    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3974 on 38 degrees of freedom
## Multiple R-squared:  0.7921, Adjusted R-squared:  0.7484 
## F-statistic:  18.1 on 8 and 38 DF,  p-value: 8.695e-11

Plot of Model for Log of Charges with Dummy Variables

Scatterplot Martix of Log of Charges/Log of BMI compared to Log of Charges and BMI Squared

Model: BMI with a Quadratic Relationship
## 
## Call:
## lm(formula = LogCharges ~ ., data = insurance.new[, c(2:10, 14)])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.65467 -0.14654 -0.04853  0.03424  1.28639 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    7.111e+00  1.356e+00   5.245 6.61e-06 ***
## Age            3.502e-02  4.643e-03   7.543 5.43e-09 ***
## BMI            6.116e-03  9.201e-02   0.066    0.947    
## Female         5.393e-02  1.280e-01   0.422    0.676    
## Children       6.287e-02  5.354e-02   1.174    0.248    
## Smoker         1.324e+00  1.802e-01   7.349 9.77e-09 ***
## WinterSprings -8.169e-03  1.844e-01  -0.044    0.965    
## WinterPark    -5.396e-02  1.837e-01  -0.294    0.771    
## Oviedo        -1.452e-01  1.542e-01  -0.941    0.353    
## BMISq          9.296e-05  1.562e-03   0.060    0.953    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4025 on 37 degrees of freedom
## Multiple R-squared:  0.7924, Adjusted R-squared:  0.7419 
## F-statistic: 15.69 on 9 and 37 DF,  p-value: 3.779e-10

When applying the solutions for the Gauss-Markov Assumptions that were violated we calculated and compared the following:

  1. Log of Charges
  2. Log of Charges and Log of Age
  3. Log of Charges and Age Squared
  4. Log of Charges and Log of BMI
  5. Log of Charges and BMI squared

Overall, our measure of fit for each Regression improved. Resulting in our SEE reducing from 6722 to around .40 in addition to R-Squared and Adjusted R-Squared increasing from 72 & 66 to around 80 & 75 for all models.

Below are the results coefficient significance and sign changes:

1. Log of Charges
- BMI is no longer significant
- Smoker is now more significant
- Age is now slightly more significant

2. Log of Charges and Log of Age
- BMI is no longer significant
- Smoker is now more significant
- Age is now slightly more significant

3. Log of Charges and Age Squared
- BMI and Age are no longer significant
- Smoker is now more significant

4. Log of Charges and Log of BMI
- BMI is no longer significant
- Age and Smoker are more significant

5. Log of Charges and BMI squared
- BMI is no longer significant
- Age and Smoker are more significant

Question 5

Use the 3 withheld observations and calculate the performance measures for your best two models. Which is the better model? (remember that “better” depends on whether your outlook is short or long run)

insurance.test$LogCharges <- log(insurance.test$Charges)
insurance.test$BMISq <- insurance.test$BMI^2
insurance.test$AgeSq <- insurance.test$Age^2
insurance.test$bad_model_pred <- predict(model, newdata = insurance.test)

insurance.test$model_1_pred <- predict(model_LogChrgBMISq,newdata = insurance.test) %>% exp()

insurance.test$model_2_pred <- predict(model_LogChrgAgeSq,newdata = insurance.test) %>% exp()

# Finding the error

insurance.test$error_bm <- insurance.test$bad_model_pred - insurance.test$Charges

insurance.test$error_1 <- insurance.test$model_1_pred - insurance.test$Charges

insurance.test$error_2 <- insurance.test$model_2_pred - insurance.test$Charges
Bias for the Bad Model, Model 1, & Model 2
## [1] 2096.91
## [1] 240.616
## [1] 356.8711
MAE for the Bad Model, Model 1, & Model 2
## [1] 5282.157
## [1] 412.3407
## [1] 512.8377
RMSE for the Bad Model, Model 1, & Model 2
## [1] 6720.431
## [1] 429.0247
## [1] 584.066
MAPE for Bad Model, Model 1, & Model 2
## [1] 0.6206971
## [1] 0.07086708
## [1] 0.07259645

The initial model performed the worst when compared to the other two. When compared to the other two, the bias, MAE, and MAPE of the logarithmic connection are lower. Since Model 2’s RMSE is lower, there were no significant prediction mistakes. Depending on your preferred time frame, you could choose any model. Model 2 is appropriate if you’re considering the near future. If you are considering the long term, choose Model 1.

Question 6

Provide interpretations of the coefficients, do the signs make sense? Perform marginal change analysis (thing 2) on the independent variables.

Summary model for Log of Charges and BMI Squared
## 
## Call:
## lm(formula = LogCharges ~ ., data = insurance.new[, c(2:10, 14)])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.65467 -0.14654 -0.04853  0.03424  1.28639 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    7.111e+00  1.356e+00   5.245 6.61e-06 ***
## Age            3.502e-02  4.643e-03   7.543 5.43e-09 ***
## BMI            6.116e-03  9.201e-02   0.066    0.947    
## Female         5.393e-02  1.280e-01   0.422    0.676    
## Children       6.287e-02  5.354e-02   1.174    0.248    
## Smoker         1.324e+00  1.802e-01   7.349 9.77e-09 ***
## WinterSprings -8.169e-03  1.844e-01  -0.044    0.965    
## WinterPark    -5.396e-02  1.837e-01  -0.294    0.771    
## Oviedo        -1.452e-01  1.542e-01  -0.941    0.353    
## BMISq          9.296e-05  1.562e-03   0.060    0.953    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4025 on 37 degrees of freedom
## Multiple R-squared:  0.7924, Adjusted R-squared:  0.7419 
## F-statistic: 15.69 on 9 and 37 DF,  p-value: 3.779e-10

Break down of Coefficient Slopes

  • Age Increases In A Linear Fashion As Do Charges.
  • As BMI Increases So Do Charges.
  • If Client Is Female Charges Increases Which Makes Sense For Pregnancy Charges.
  • All Locations Help Decreases Charges Unless Default At Sanford.

Marginal Change Analysis

With using a confidence level of 95%, the below results would occur when age increases by 1 year. - If a person’s Age increases by 1, their charges would increase by $0.04 give or take $0.01. - If a person is a smoker, their charges would increase by $1.32 give or take $0.38.

Question 7

An eager insurance representative comes back with five potential clients. Using the better of the two models selected above, provide the prediction intervals for the five potential clients using the information provided by the insurance rep.

Customer Age BMI Female Children Smoker City
1 60 22 1 0 0 Oviedo
2 40 30 0 1 0 Sanford
3 25 25 0 0 1 Winter Park
4 33 35 1 2 0 Winter Springs
5 45 27 1 3 0 Oviedo
## 
## Call:
## lm(formula = LogCharges ~ ., data = insurance.new[, c(2:10, 14)])
## 
## Coefficients:
##   (Intercept)            Age            BMI         Female       Children  
##     7.111e+00      3.502e-02      6.116e-03      5.393e-02      6.287e-02  
##        Smoker  WinterSprings     WinterPark         Oviedo          BMISq  
##     1.324e+00     -8.169e-03     -5.396e-02     -1.452e-01      9.296e-05
##         fit      lwr      upr
## 1 10940.686 4345.449 27545.74
## 2  6915.164 2941.044 16259.36
## 3 12933.267 4787.337 34939.97
## 4  6410.797 2541.912 16168.27
## 5  8240.879 3380.672 20088.34

Question 8

The owner notices that some of the predictions are wider than others, explain why.

The largest range for the group of five customers is customer #3. They are a 25 year old male smoker with no children living in Winter Park. The second largest range was customer #1, who is a 60 year old female with no children living in Oviedo. - Due to Age and Smoker having the most significance on Charges, this the cause for the large range.

Question 9

Are there any prediction problems that occur with the five potential clients? If so, explain.

No prediction problems occur with the five potential potential clients. The correlation between Charges, Age, and smoker are significant. The potential prediction problem outlier could occur due to Customer #4 using our Model # 1 with a r-sqaure of 80%. Due to customer #4 having the highest BMI of the group that is higher than the mean & median could indicate that Customer #4 is our outlier.**